Cache various statistics to improve performance #204

JCGoran · 2024-07-21T14:06:49Z

An implementation that resolves #201.

Describe your changes

This is a rather large-looking PR (only because I am basing this off of #198 for ease-of-merge!), and is a sort of a first attempt (which is why I am marking it as a draft) at caching the various statistics properties as described in #201.

TL;DR: since at each iteration we only move one point in the dataset (basically map $(x_i, y_i) \mapsto (x'_i, y'_i)$ ), why not compute the new statistics ($\langle X' \rangle$, $\text{Var}(X') $, and $r(X', Y') $) in terms of the old ones ($\langle X \rangle$, $\text{Var}(X) $, and $r(X, Y) $) + the old ($(x_i, y_i) $) and new ($(x'_i, y'_i) $) points? This is essentially what this PR does.

Details

Mathematical derivation

The mean is shifted as:

$$ \langle X' \rangle = \langle X \rangle + \frac{\delta_x}{n} $$

where $\delta_x = x'_i - x_i$ (same for $Y$), the variance (or the standard deviation if you want) is shifted as:

$$ \text{Var}(X') = \text{Var}(X) + 2 \frac{\delta_x}{n} (x_i - \langle X \rangle) + \frac{\delta_x^2}{n} - \frac{\delta_x^2}{n^2} $$

while the correlation coefficient is shifted as:

$$ r(X', Y') = \frac{\langle X Y \rangle + \frac{1}{n}(\delta_x y_i + \delta_y x_i + \delta_x \delta_y) - \langle X' \rangle \langle Y' \rangle}{\sqrt{\text{Var}(X') \text{Var}(Y')}} $$

where the only new quantity to keep track of is $\langle X Y \rangle$ (the rest can be obtained from the above 2).

Workflow

With the math out of the way, this is how the new way of computing the statistics works:

read the (x, y) dataset, and make an instance of Statistics (subject to change for its somewhat ambiguous naming) from it
start the iteration
Statistics has a method, perturb, which takes in 3 arguments: the row (index) we wish to perturb, and the 2 values of the perturbations (in the x and y directions), and returns an instance of the new SummaryStatistics (i.e. the one from the perturbed data)
if _is_close_enough returns True, we call the perturb method again (it's $O(1)$ so not very expensive to call), but this time with update=True, which actually updates the data in the Statistics instance
go to the next iteration

Note that there's a bunch of new methods (shifted_mean, shifted_var, shifted_stdev, shifted_corrcoef), which we may or may not want to make private (or refactor so they accept different arguments), as I didn't put too much thought into the overall API design (again, hence the "draft" status).

Performance

Now, onto the performance results! Note that the benchmarks were done with #201 already in there, so the absolute numbers differ from the ones on main, but the reduction in the number of function calls is evident. I tested everything using python -m cProfile -m data_morph --seed 42 --start-shape panda --target-shape [shape].

# star shape - perf before
79257097 function calls (78610163 primitive calls) in 43.075 seconds
# star shape - perf after
54021781 function calls (52470658 primitive calls) in 28.704 seconds
# rings shape - perf before
94095798 function calls (93420792 primitive calls) in 43.998 seconds
# rings shape - perf after
68846080 function calls (67219947 primitive calls) in 29.592 seconds
# bullseye shape - perf before
82412199 function calls (81796202 primitive calls) in 40.245 seconds
# bullseye shape - perf after
57109757 function calls (55635954 primitive calls) in 26.200 seconds

so we use 26.1M less function calls for all shapes in question, which results in about 35% faster performance (computed as (t_old - t_new) / t_old) * 100).

Discussion items

Some possible points of discussion:

I didn't pay too much attention to Bessel's correction in the derivation of the formulas above, though I imagine the numerical difference should be mostly negligible since all of the starting datasets have > 100 points
API design (make shifted_* functions private? Could perturb do without an update argument? etc.)
the tests for shifted_* functions are a bit repetitive, and could take a while to run (I was initially very skeptical of the numerical stability, but it appears to be fine, though it could just be that I haven't tried enough scenarios to hit an instability)

Feedback is welcome :)

Checklist

Test cases have been modified/added to cover any code changes.
Docstrings have been modified/created for any code changes.
All linting and formatting checks pass (see the contributing guidelines for more information).

Order of operations was wrong before, causing incorrect outputs.

The implementation uses an AVL tree to keep track of the low and high parts of the input array, and then updates the trees accordingly in O(log n) time (instead of O(n) provided by numpy.median).

codecov · 2024-11-11T17:16:28Z

Codecov Report

Attention: Patch coverage is 96.47059% with 9 lines in your changes missing coverage. Please review.

Project coverage is 98.21%. Comparing base (5d0486b) to head (656c96d).

Files with missing lines	Patch %	Lines
src/data_morph/data/stats.py	91.34%	6 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #204      +/-   ##
==========================================
- Coverage   98.47%   98.21%   -0.27%     
==========================================
  Files          43       43              
  Lines        1839     2075     +236     
  Branches      114      129      +15     
==========================================
+ Hits         1811     2038     +227     
- Misses         25       31       +6     
- Partials        3        6       +3

Files with missing lines	Coverage Δ
src/data_morph/data/dataset.py	`73.75% <100.00%> (+0.67%)`	⬆️
src/data_morph/morpher.py	`100.00% <100.00%> (ø)`
src/data_morph/plotting/static.py	`100.00% <100.00%> (ø)`
tests/data/test_stats.py	`100.00% <100.00%> (ø)`
tests/test_morpher.py	`100.00% <100.00%> (ø)`
src/data_morph/data/stats.py	`91.66% <91.34%> (-8.34%)`	⬇️

JCGoran · 2024-11-12T12:04:38Z

Regarding the median: I was not satisfied with the performance of pure Python implementations of sorted containers which have $O(\log n)$ time complexity for the various ops (add, remove, search), so I decided to implement my own, written in C++ (since std::map is basically a sorted version of Python's dict, with the tradeoff that ops are $O(\log n)$ and not $O(1)$ ) with bindings to Python using nanobind.

JCGoran added 7 commits July 15, 2024 23:43

Refactor to use more numpy functions internally

298870b

Add shifted_* functions and Statistics class

a52b25d

Actually use the cached statistics

2a80597

Remove unused comment

58f6d7c

Use value_new instead of delta for shifted_var

9635da5

Fix bug in implementation of morph

c683af7

Order of operations was wrong before, causing incorrect outputs.

Updated all docstrings accordingly

448c845

github-actions bot added testing Relating to the testing suite shapes Work relating to shapes module data Work relating to data module plotting Work relating to plotting module morpher Work relating to morpher module labels Jul 21, 2024

JCGoran added 6 commits July 22, 2024 20:07

Add implementation of median

4f7882e

The implementation uses an AVL tree to keep track of the low and high parts of the input array, and then updates the trees accordingly in O(log n) time (instead of O(n) provided by numpy.median).

Merge branch 'main' into jelic/feature/cache_statistics

b1db8d7

Add median to cached statistics

fce745d

Small fixup for median

53078c5

Use custom implementation of SortedCounter

aef3325

Merge branch 'main' into jelic/feature/cache_statistics

1a3d8ad

github-actions bot added the packaging Relating to packaging label Nov 10, 2024

Update package name

186763d

Please the lint overlord

656c96d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache various statistics to improve performance #204

Cache various statistics to improve performance #204

JCGoran commented Jul 21, 2024

codecov bot commented Nov 11, 2024 •

edited

Loading

JCGoran commented Nov 12, 2024 •

edited

Loading

Cache various statistics to improve performance #204

Are you sure you want to change the base?

Cache various statistics to improve performance #204

Conversation

JCGoran commented Jul 21, 2024

Details

Mathematical derivation

Workflow

Performance

Discussion items

codecov bot commented Nov 11, 2024 • edited Loading

Codecov Report

JCGoran commented Nov 12, 2024 • edited Loading

codecov bot commented Nov 11, 2024 •

edited

Loading

JCGoran commented Nov 12, 2024 •

edited

Loading